Skip to content

Fix ESP32 RSA watchdog timeout during key generation#3018

Open
valientegaston wants to merge 6 commits intotoitlang:masterfrom
valientegaston:fix-esp32-rsa-deadlock
Open

Fix ESP32 RSA watchdog timeout during key generation#3018
valientegaston wants to merge 6 commits intotoitlang:masterfrom
valientegaston:fix-esp32-rsa-deadlock

Conversation

@valientegaston
Copy link
Copy Markdown
Contributor

Problem
On ESP32, generating an RSA 2048-bit key takes longer than 10 seconds, causing the watchdog timer to kill the process, mistakenly believing it is hung.

Root cause
The mbedtls RSA operations (rsa_generate, rsa_sign, rsa_encrypt, rsa_decrypt) were passing NULL as the context to the rsa_rng callback. The scheduler watchdog monitors the run_timestamp of each process and kills it if it hasn't been updated within the timeout window. Since the RSA key generation is a long-running blocking operation, the timestamp was never updated during its execution.

Fix
The rsa_rng callback is called many times by mbedtls during RSA operations — in particular during key generation, it is called repeatedly to generate prime candidates and random bases for Miller-Rabin primality tests. This makes it a natural and low-overhead place to reset the watchdog by updating the process run_timestamp.
The fix passes the Process* as the context to all rsa_rng calls, and inside rsa_rng the scheduler mutex is taken to safely update the timestamp.

Safety
We verified that primitives execute inside an Unlocker block in Scheduler::run_process, meaning the scheduler mutex is not held during RSA operations. Therefore, taking it inside rsa_rng is safe and will not cause a deadlock.

@valientegaston
Copy link
Copy Markdown
Contributor Author

I also noticed that CONFIG_MBEDTLS_HARDWARE_MPI is explicitly set to n across all ESP32 toolchain configurations (esp32, esp32s2, esp32s3, esp32c3, esp32c6). Was this intentional? I understand it wouldn't help with key generation (as the prime generation and Miller-Rabin tests run in software regardless), but it could potentially speed up everyday RSA operations like signing, verification, and encryption/decryption.

@floitsch
Copy link
Copy Markdown
Member

floitsch commented Apr 6, 2026

Primitives shouldn't be blocking. We want to run other tasks when waiting for a primitive. As such, the scheduler revealed a real issue.
I think the best way is to run the key generation in a separate thread.
Have a look at src/event_sources/async_posix.cc and src/resources/spi_linux.cc (which uses it). With a bit of luck the same approach works on the ESP32.
It would be a bit more complicated:

  • create a "RsaGeneration" resource that contains the thread, and then do the creation of the key using that resource.

I'm guessing we set the MBEDTLS_HARDWARE_MPI to 'n' to reduce the image size. We can definitely enable it again.
I don't remember anymore if the RSA-key generation is behind an sdkconfig flag or not. I think we already have a TOIT_CRYPTO_EXTRA flag, though. So that's probably enough. The new resource should definitely be behind that one.

@valientegaston
Copy link
Copy Markdown
Contributor Author

Hi! I have implemented the RSA key generation following your suggestions:

Non-blocking: The primitives are now non-blocking. The key generation runs in a separate thread, and the Toit Task is suspended while waiting for the resource event.

RsaGeneration Resource: I created a new RsaGeneration resource (contained in StructTag as RsaGenerationResourceTag and RsaGenerationResourceGroupTag) to manage the lifecycle of the asynchronous task.

Architecture: I followed the approach in async_posix.cc, using the AsyncEventThread to coordinate the background generation and the completion event.

Flags & Configuration: The new resource and primitives are correctly guarded by the CONFIG_TOIT_CRYPTO_EXTRA sdkconfig flag as you proposed. I've also updated the sdkconfig.defaults for ESP32 targets to enable this by default.

Binary Compatibility: I made sure to append the new tags at the end of the StructTag enum to avoid any issues with existing bootstrap snapshots.

Please let me know if there's any part of the implementation you'd like me to refine further. Thanks!

- Add RsaGenerationResource_ class with finalizer/close pattern to
  ensure proper cleanup of resource group and state on exceptions
  (e.g. with-timeout).
- Add rsa-generate-close primitive to tear down the resource group.
- Fix dangling proxy in rsa_generate_finish error paths by adding
  missing clear_external_address calls.
- Add no-op Thread::cancel() for ESP32 (required since async_posix
  is now compiled for ESP32).
@floitsch
Copy link
Copy Markdown
Member

floitsch commented Apr 7, 2026

Opened a fix PR on your fork: valientegaston#1

It addresses a few resource cleanup issues:

  • The resource group and state were not cleaned up on exceptions (e.g. with-timeout wrapping the generate call). Added a RsaGenerationResource_ class with the standard finalizer/close pattern.
  • Missing resource_proxy->clear_external_address() on error paths in rsa_generate_finish (dangling pointer).
  • Missing Thread::cancel() on ESP32 (linker error since async_posix is now compiled for ESP32).

floitsch and others added 2 commits April 7, 2026 22:25
Comment thread src/primitive_crypto.cc Outdated
free(prv); free(pub);
} else {
// mbedtls writes from the end. Move to start.
unsigned char* prv_start = (unsigned char*)malloc(prv_len);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't have enough memory here, we would have to throw away everything we did so far and start again.
Also: I'm not even sure things would behave correctly: By marking the error as MBEDTLS_ERR_PK_ALLOC_FAILED the rsa_generate_finish would return a "malloc failed". This would indicate to the system that the primitive ran out of memory and that we should do a GC, then try again. However, the rsa_generate_finish would just fail now as the resource_proxy just has been cleared.

-> We need to allocate the memory in the main-thread (that has called the primitive) and not inside the event-thread. Unfortunately, this means that we need to know the prv_len und pub_len already before we actually call mbedtls_pk_write_key_der.

That said: it looks to me like a resize is the better option anyway. The prv_len/pub_len just seem to say how much of the given buffer was written too.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced the two extra malloc calls with memmove on the same buffer to shift the data to the start, followed by realloc to shrink it to the exact size. If realloc fails it returns null without altering the original block, so the original pointer is preserved and passed to set_results correctly — no memory leak or dangling pointer.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same still applies to the 'malloc' in line 986/987. (the ret goes to the rsa_generate_finish, leading to an unrecoverable OOM). The mallocs need to be done in the main thread. Here we just use constants (RSA_PRV_DER_MAX_BYTES), so that's relatively easy to fix.

Comment thread src/scheduler.h Outdated
bool is_locked() const { return OS::is_locked(mutex_); }
bool is_boot_process(Process* process) const { return boot_process_ == process; }

Mutex* mutex() const { return mutex_; }
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Necessary?
If yes, we should try to find a better approach than to make the scheduler's mutex public.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer necessary. In the previous approach, the rsa_rng callback needed to take the scheduler mutex to update the process run_timestamp. With the new async design, the Toit process is suspended during key generation, so there's no need to update the timestamp from the RNG callback at all. The mutex() accessor has been removed.

CONFIG_MBEDTLS_X509_TRUSTED_CERT_CALLBACK=y
CONFIG_MBEDTLS_ECP_RESTARTABLE=y
CONFIG_MBEDTLS_HARDWARE_MPI=n
CONFIG_MBEDTLS_HARDWARE_MPI=y
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any downside to enabling MPI?

For some esp32 variants we were fighting the partition sizes already. So if this increases the image size, we need to make sure we aren't too big now.

Copy link
Copy Markdown
Contributor Author

@valientegaston valientegaston Apr 18, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I measured the binary size impact on ESP32:

  • Without HARDWARE_MPI: 0x1561f0 bytes (18% free of 0x1a0000 partition)
  • With HARDWARE_MPI: 0x156390 bytes (18% free of 0x1a0000 partition)

The difference is only ~416 bytes.

I only tested on ESP32 so far — still checking the other variants.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ESP32S3:

Without HARDWARE_MPI: 0x14d5c0 bytes (20% free)
With HARDWARE_MPI: 0x14d840 bytes (20% free)
Difference: ~640 bytes

Still checking the remaining variants (esp32c3, esp32c6, esp32s2).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ESP32S2:

Without HARDWARE_MPI: 0x116a50 bytes (33% free)
With HARDWARE_MPI: 0x116ce0 bytes (33% free)
Difference: ~656 bytes

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ESP32C3:

Without HARDWARE_MPI: 0x163e00 bytes (14% free)
With HARDWARE_MPI: 0x164560 bytes (14% free)
Difference: ~1888 bytes

ESP32C6:

Without HARDWARE_MPI: 0x1818c0 bytes (11% free)
With HARDWARE_MPI: 0x182020 bytes (11% free)
Difference: ~1888 bytes

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks.

@floitsch
Copy link
Copy Markdown
Member

@valientegaston do you think you can address the feedback?

@valientegaston
Copy link
Copy Markdown
Contributor Author

Hi! Yes, I'll be addressing the feedback in the coming days. We've been quite busy lately but I'll make sure to go through all the comments and make the necessary changes. Thanks for your patience!

Copy link
Copy Markdown
Member

@floitsch floitsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still some subtle issues.
Not completely sure what the best way for the resource handling is.
Maybe one way is to keep track of whether the parallel thread is running and store that in the resource. When the resource gets the signal to destroy itself, it could then delay that action until the thread returns. Alternatively (if possible) killing the thread first, would also work.
We do something slightly similar in the BLE code.

Comment thread src/primitive_crypto.cc
}

if (ret == 0) {
unsigned char* prv = (unsigned char*)malloc(RSA_PRV_DER_MAX_BYTES);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When do we free the memory again? Shouldn't that happen in rsa_generate_finish?
But that's more complicated:

  • what if nothing ever calls 'rsa_generate_finish'? This means that the memory needs to be attached to the resource.
  • However, that immediately reveals another race condition (which was already present): what if the program stops while we are running in parallel. The res resource could already be freed at that point.

Comment thread src/primitive_crypto.cc Outdated
free(prv); free(pub);
} else {
// mbedtls writes from the end. Move to start.
unsigned char* prv_start = (unsigned char*)malloc(prv_len);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same still applies to the 'malloc' in line 986/987. (the ret goes to the rsa_generate_finish, leading to an unrecoverable OOM). The mallocs need to be done in the main thread. Here we just use constants (RSA_PRV_DER_MAX_BYTES), so that's relatively easy to fix.

Comment thread src/primitive_crypto.cc
memmove(prv, prv + RSA_PRV_DER_MAX_BYTES - prv_len, prv_len);
memmove(pub, pub + RSA_PUB_DER_MAX_BYTES - pub_len, pub_len);

unsigned char* prv_resized = (unsigned char*)realloc(prv, prv_len);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add the following information:
RSA_PRV_DER_MAX_BYTES is typically ~550 bytes.
Typically, keys only need ~300 bytes.
As such the realloc is useful.
However, if it fails, it's not the end of the world either.

Comment thread src/primitive_crypto.cc
}

if (ret == 0) {
unsigned char* prv = (unsigned char*)malloc(RSA_PRV_DER_MAX_BYTES);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use unvoid_cast instead of C-style casts. Here and in all the other places.

Comment thread src/primitive_crypto.cc
}

ByteArray* prv_der = process->allocate_byte_array(resource->prv_len());
ByteArray* pub_der = process->allocate_byte_array(resource->pub_len());
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another option is to use an external ByteArray and use the existing buffers. That would decrease the peak amount of memory.
I think for 250+ bytes it's ok to have an external byte-array.
However, at that point I would do another realloc, just in case the one in the other thread didn't succeed. If it still doesn't succeed, you can do a MALLOC_FAILED.

Comment thread src/primitive_crypto.cc
Comment on lines +1039 to +1040
resource->resource_group()->unregister_resource(resource);
resource_proxy->clear_external_address();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bit too eager.
For OOMs the primitive will do a GC and then try again. If you destroy the resource the second attempt wouldn't work anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants